Source of data: https://www.kaggle.com/datasets/teejmahal20/airline-passenger-satisfaction
This dataset contains an airline passenger satisfaction survey.
| Variable | Description | Values |
|---|---|---|
| PASSENGERS INFORMATION | ||
| Satisfaction (target) | Airline satisfaction level | satisfied, neutral, or dissatisfied |
| Gender | Gender of the passengers | Female, Male |
| Customer.Type | The customer type | Loyal customer, disloyal Customer |
| Age | The actual age of the passengers | |
| Type.of.Travel | Purpose of the flight of the passengers | Personal Travel, Business Travel |
| Class | Travel class in the plane of the passengers | Business, Eco, Eco Plus |
| FLIGHTS INFORMATION | ||
| Flight.Distance | The flight distance of this journey | |
| Departure.Delay.in.Minutes | Minutes delayed when depart | |
| Arrival.Delay.in.Minutes | Minutes delayed when arrive | |
| SERVICE RATINGS | ||
| Inflight.wifi.service | Satisfaction level of the inflight wifi service | 0: Not Applicable; 1 - 5 |
| Departure.Arrival.time.convenient | Satisfaction level of Departure/Arrival time convenient | 0: Not Applicable; 1 - 5 |
| Ease.of.Online.booking | Satisfaction level of online booking | 0: Not Applicable; 1 - 5 |
| Gate.location | Satisfaction level of gate location | 1 - 5 |
| Food.and.drink | Satisfaction level of food and drink | 0: Not Applicable; 1 - 5 |
| Online.boarding | Satisfaction level of online boarding | 0: Not Applicable; 1 - 5 |
| Seat.comfort | Satisfaction level of seat comfort | 1 - 5 |
| Inflight.entertainment | Satisfaction level of inflight entertainment | 0: Not Applicable; 1 - 5 |
| On-board.service | Satisfaction level of on-board service | 0: Not Applicable; 1 - 5 |
| Leg.room.service | Satisfaction level of leg room service | 0: Not Applicable; 1 - 5 |
| Baggage.handling | Satisfaction level of baggage handling | 1 - 5 |
| Checkin.service | Satisfaction level of Check-in service | 1 - 5 |
| Inflight.service | Satisfaction level of inflight service | 0: Not Applicable; 1 - 5 |
| Cleanliness | Satisfaction level of cleanliness | 0: Not Applicable; 1 - 5 |
# overview of data attributes
str(air)
## 'data.frame': 25976 obs. of 24 variables:
## $ id : int 19556 90035 12360 77959 36875 39177 79433 97286 27508 62482 ...
## $ Gender : chr "Female" "Female" "Male" "Male" ...
## $ Customer.Type : chr "Loyal Customer" "Loyal Customer" "disloyal Customer" "Loyal Customer" ...
## $ Age : int 52 36 20 44 49 16 77 43 47 46 ...
## $ Type.of.Travel : chr "Business travel" "Business travel" "Business travel" "Business travel" ...
## $ Class : chr "Eco" "Business" "Eco" "Business" ...
## $ Flight.Distance : int 160 2863 192 3377 1182 311 3987 2556 556 1744 ...
## $ Inflight.wifi.service : int 5 1 2 0 2 3 5 2 5 2 ...
## $ Departure.Arrival.time.convenient: int 4 1 0 0 3 3 5 2 2 2 ...
## $ Ease.of.Online.booking : int 3 3 2 0 4 3 5 2 2 2 ...
## $ Gate.location : int 4 1 4 2 3 3 5 2 2 2 ...
## $ Food.and.drink : int 3 5 2 3 4 5 3 4 5 3 ...
## $ Online.boarding : int 4 4 2 4 1 5 5 4 5 4 ...
## $ Seat.comfort : int 3 5 2 4 2 3 5 5 5 4 ...
## $ Inflight.entertainment : int 5 4 2 1 2 5 5 4 5 4 ...
## $ On.board.service : int 5 4 4 1 2 4 5 4 2 4 ...
## $ Leg.room.service : int 5 4 1 1 2 3 5 4 2 4 ...
## $ Baggage.handling : int 5 4 3 1 2 1 5 4 5 4 ...
## $ Checkin.service : int 2 3 2 3 4 1 4 5 3 5 ...
## $ Inflight.service : int 5 4 2 1 2 2 5 4 3 4 ...
## $ Cleanliness : int 5 5 2 4 4 5 3 3 5 4 ...
## $ Departure.Delay.in.Minutes : int 50 0 0 0 0 0 0 77 1 28 ...
## $ Arrival.Delay.in.Minutes : num 44 0 0 6 20 0 0 65 0 14 ...
## $ satisfaction : chr "satisfied" "satisfied" "neutral or dissatisfied" "satisfied" ...
# descriptive summary of the data
summary(air)
## id Gender Customer.Type Age
## Min. : 17 Length:25976 Length:25976 Min. : 7.00
## 1st Qu.: 32170 Class :character Class :character 1st Qu.:27.00
## Median : 65320 Mode :character Mode :character Median :40.00
## Mean : 65006 Mean :39.62
## 3rd Qu.: 97584 3rd Qu.:51.00
## Max. :129877 Max. :85.00
##
## Type.of.Travel Class Flight.Distance Inflight.wifi.service
## Length:25976 Length:25976 Min. : 31 Min. :0.000
## Class :character Class :character 1st Qu.: 414 1st Qu.:2.000
## Mode :character Mode :character Median : 849 Median :3.000
## Mean :1194 Mean :2.725
## 3rd Qu.:1744 3rd Qu.:4.000
## Max. :4983 Max. :5.000
##
## Departure.Arrival.time.convenient Ease.of.Online.booking Gate.location
## Min. :0.000 Min. :0.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.000
## Median :3.000 Median :3.000 Median :3.000
## Mean :3.047 Mean :2.757 Mean :2.977
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :5.000 Max. :5.000 Max. :5.000
##
## Food.and.drink Online.boarding Seat.comfort Inflight.entertainment
## Min. :0.000 Min. :0.000 Min. :1.000 Min. :0.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.000
## Median :3.000 Median :4.000 Median :4.000 Median :4.000
## Mean :3.215 Mean :3.262 Mean :3.449 Mean :3.358
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:4.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
##
## On.board.service Leg.room.service Baggage.handling Checkin.service
## Min. :0.000 Min. :0.00 Min. :1.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:2.00 1st Qu.:3.000 1st Qu.:3.000
## Median :4.000 Median :4.00 Median :4.000 Median :3.000
## Mean :3.386 Mean :3.35 Mean :3.633 Mean :3.314
## 3rd Qu.:4.000 3rd Qu.:4.00 3rd Qu.:5.000 3rd Qu.:4.000
## Max. :5.000 Max. :5.00 Max. :5.000 Max. :5.000
##
## Inflight.service Cleanliness Departure.Delay.in.Minutes
## Min. :0.000 Min. :0.000 Min. : 0.00
## 1st Qu.:3.000 1st Qu.:2.000 1st Qu.: 0.00
## Median :4.000 Median :3.000 Median : 0.00
## Mean :3.649 Mean :3.286 Mean : 14.31
## 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.: 12.00
## Max. :5.000 Max. :5.000 Max. :1128.00
##
## Arrival.Delay.in.Minutes satisfaction
## Min. : 0.00 Length:25976
## 1st Qu.: 0.00 Class :character
## Median : 0.00 Mode :character
## Mean : 14.74
## 3rd Qu.: 13.00
## Max. :1115.00
## NA's :83
There are several points to note from our descriptive analysis so far:
The ranking features don’t all share the same scale. This is because in these features, 0 denotes that the services were not offered on the flights, therefore not applicable for the customer to rank. With features where the ranking ranges from 1 - 5, it can be understood that the services were offered on all flights.
0 - 5 : Inflight.wifi.service, Departure.Arrival.time.convenient, Ease.of.Online.booking, Food.and.drink, Online.boarding, Inflight.entertainment, On.board.service, Leg.room.service, Inflight.service, Cleanliness
1 - 5 : Gate.location, Seat.comfort, Baggage.handling, Checkin.service
Departure.Delay.in.Minutes and Arrival.Delay.in.Minutes have entries of 0 in them. These entries do not represent missing values, but rather that flights did not experience any delays in arrival or departure.
There are 83 NA values in Arrival.Delay.in.Minutes.
To avoid any bias and misinterpretation of data, we will check for missing values in the dataset. If missing values exist, we will look further into each column.
# visualize missing data
vis_miss(air) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
# total number of missing values
cat("\nTotal number of missing values:", sum(is.na(air)))
##
## Total number of missing values: 83
# percentage of missing data
cat("\nPercentage of missing data:", (sum(is.na(air))/nrow(air))*100, "%")
##
## Percentage of missing data: 0.3195257 %
The visualization of missing values for the entire dataset confirms our observation earlier, that there are only NA values in Arrival.Delay.in.Minutes. Since they constitute only 0.32% of the entire dataset, it is safe to remove all rows with NA values.
# drop missing values
air <- air %>% drop_na()
cat("Number of missing values:", sum(is.na(air)))
## Number of missing values: 0
We also want to check if there are duplicated records in the dataset. Since ID is the unique key of the dataset, we can use ID to check and identify if duplication exists.
# check for duplicates
sum(duplicated(air$id))
## [1] 0
# count and percentage of each categories in satisfaction
satisfaction_count <- as.data.frame(table(air$satisfaction))
satisfaction_count$percent <- satisfaction_count$Freq / sum(satisfaction_count$Freq) * 100
# pie chart
ggplot(satisfaction_count, aes(x = '', y = Freq, fill = factor(Var1))) +
geom_bar(stat = 'identity', width = 1) +
coord_polar(theta = 'y') +
theme_void() +
scale_fill_manual(values = c('#3F7ED5', '#C7DFF9'), name = '') +
labs(title = 'Pie chart of satisfaction') +
theme(legend.position = 'right',
legend.text = element_text(size = 10),
plot.title = element_text(face = 'bold', hjust = 0.5)) +
geom_text(aes(label = paste0(round(percent, 1), '%')),
position = position_stack(vjust = 0.5))
# name of categorical columns
cat_cols <- colnames(air[c('Class', 'Customer.Type', 'Gender', 'Type.of.Travel')])
for (col in cat_cols) {
# a data frame with the count of each category in the current column
cat_count <- data.frame(table(air[[col]]))
# a bar plot of the count data
print(ggplot(cat_count, aes(x = Var1, y = Freq)) +
geom_bar(stat = 'identity', fill = '#3F7ED5') +
geom_text(aes(label = Freq), vjust = -0.5) +
theme(plot.title = element_text(face = 'bold'),
axis.ticks = element_blank()) +
labs(title = col, x = '', y = "Count"))
}
# name of numerical columns
num_cols <- air %>%
select_if(function(x) length(unique(x)) > 6) %>%
colnames
# histogram for each numerical column
for (col in num_cols) {
print(ggplot(air, aes(x = .data[[col]])) +
geom_histogram(binwidth = 0.5, color = '#3F7ED5')+
labs(title = col, x = '', y = 'Frequency') +
theme(axis.ticks = element_blank(),
plot.title= element_text(face = "bold")))
}
# name of columns with rank data
rank_cols <- air %>%
select_if(~ is.numeric(.) && n_distinct(.) <= 6) %>%
colnames()
# reshape data from wide to long format
rank_long <- air %>%
pivot_longer(cols = all_of(rank_cols),
names_to = 'column_name',
values_to = 'rank')
# group data by column name and rank
# calculate the count of each rank in each column
rank_counts <- rank_long %>%
group_by(column_name, rank) %>%
summarise(count = n(), .groups = 'drop')
# plot stacked bars
ggplot(rank_counts, aes(x = column_name, y = count, fill = factor(rank))) +
geom_bar(stat = 'identity') +
scale_fill_manual(values = c('#C7DFF9', '#A4C8F0', '#7FADEB', '#5E95E0', '#3F7ED5', '#2B5FA6'),
name = 'Rank',
labels = c('0', '1', '2', '3', '4', '5')) +
labs(x = '', y = 'Count', fill = "Rank") +
theme(axis.text.x = element_text(angle = 90, hjust = 1),
axis.ticks = element_blank(),
panel.background = element_blank())
for (col in rank_cols) {
# a data frame with the count of each category in the current column
rank_count <- data.frame(table(air[[col]]))
# a bar plot of the count data
print(ggplot(rank_count, aes(x = Var1, y = Freq)) +
geom_bar(stat = 'identity', fill = '#3F7ED5') +
geom_text(aes(label = Freq), vjust = -0.5) +
theme(plot.title = element_text(face = 'bold'),
axis.ticks = element_blank()) +
labs(title = col, x = '', y = "Count"))
}
# correlation matrix
corr_matrix <- correlate(air[,c(4, 7:23)], diagonal = 1)
## Correlation computed with
## • Method: 'pearson'
## • Missing treated using: 'pairwise.complete.obs'
corr_matrix
## # A tibble: 18 × 19
## term Age Flight…¹ Inflig…² Depart…³ Ease.o…⁴ Gate.l…⁵ Food.a…⁶ Online…⁷
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Age 1 9.99e-2 0.00912 3.19e-2 1.37e-2 2.68e-3 0.0246 0.203
## 2 Flig… 0.0999 1 e+0 0.00460 -1.49e-2 6.20e-2 7.95e-3 0.0578 0.215
## 3 Infl… 0.00912 4.60e-3 1 3.49e-1 7.11e-1 3.48e-1 0.122 0.459
## 4 Depa… 0.0319 -1.49e-2 0.349 1 e+0 4.40e-1 4.59e-1 -0.0153 0.0809
## 5 Ease… 0.0137 6.20e-2 0.711 4.40e-1 1 e+0 4.66e-1 0.0255 0.408
## 6 Gate… 0.00268 7.95e-3 0.348 4.59e-1 4.66e-1 1 e+0 -0.00941 0.00706
## 7 Food… 0.0246 5.78e-2 0.122 -1.53e-2 2.55e-2 -9.41e-3 1 0.230
## 8 Onli… 0.203 2.15e-1 0.459 8.09e-2 4.08e-1 7.06e-3 0.230 1
## 9 Seat… 0.155 1.59e-1 0.116 -2.01e-3 2.30e-2 -1.05e-3 0.582 0.416
## 10 Infl… 0.0695 1.38e-1 0.201 -2.21e-2 4.46e-2 -4.78e-4 0.627 0.279
## 11 On.b… 0.0552 1.18e-1 0.114 6.08e-2 4.02e-2 -3.14e-2 0.0513 0.150
## 12 Leg.… 0.0330 1.37e-1 0.160 3.37e-3 1.17e-1 -2.28e-3 0.0364 0.121
## 13 Bagg… -0.0505 7.12e-2 0.119 6.56e-2 4.07e-2 -4.20e-3 0.0378 0.0846
## 14 Chec… 0.0258 7.53e-2 0.0465 8.29e-2 4.58e-4 -5.46e-2 0.0773 0.204
## 15 Infl… -0.0593 6.61e-2 0.109 6.80e-2 3.55e-2 -5.24e-3 0.0407 0.0723
## 16 Clea… 0.0489 1.06e-1 0.125 -7.07e-3 1.10e-2 -1.42e-2 0.660 0.321
## 17 Depa… -0.00545 2.40e-3 -0.0104 -2.25e-4 -1.44e-3 7.62e-3 -0.0267 -0.0225
## 18 Arri… -0.00740 1.31e-4 -0.0123 -1.35e-3 -3.16e-3 7.78e-3 -0.0283 -0.0259
## # … with 10 more variables: Seat.comfort <dbl>, Inflight.entertainment <dbl>,
## # On.board.service <dbl>, Leg.room.service <dbl>, Baggage.handling <dbl>,
## # Checkin.service <dbl>, Inflight.service <dbl>, Cleanliness <dbl>,
## # Departure.Delay.in.Minutes <dbl>, Arrival.Delay.in.Minutes <dbl>, and
## # abbreviated variable names ¹Flight.Distance, ²Inflight.wifi.service,
## # ³Departure.Arrival.time.convenient, ⁴Ease.of.Online.booking,
## # ⁵Gate.location, ⁶Food.and.drink, ⁷Online.boarding
# heatmap of correlation
corr_matrix %>%
rearrange(method = "MDS", absolute = FALSE) %>%
shave() %>%
rplot(shape = 15, colours = c("#D3D3D3", "#3F7ED5")) +
theme(axis.text.x = element_text(angle = 90, vjust = 1, hjust = 1),
axis.text = element_text(size = 8),
axis.ticks = element_blank())
Before applying dimension deduction technique to the dataset, we can look at the correlation matrix to identify any strong / relatively strong positive or negative correlations between the numerical features.
If we set a cutoff point of 0.3 for the correlation coefficient, we can see that Age, Flight.Distance and Checkin.service does not have any correlation coefficients higher than 0.3, indicating that these 3 variables are highly independent from other variables and should be excluded from our analysis.
# set a cut-off point for correlation coefficients
cutoff <- function(x) all(abs(x) < 0.3, na.rm = TRUE)
# extract variables that has no coefficients
correlate(air[,c(4, 7:23)]) %>%
focus_if(cutoff, mirror = FALSE)
## Correlation computed with
## • Method: 'pearson'
## • Missing treated using: 'pairwise.complete.obs'
## # A tibble: 15 × 4
## term Age Flight.Distance Checkin.service
## <chr> <dbl> <dbl> <dbl>
## 1 Inflight.wifi.service 0.00912 0.00460 0.0465
## 2 Departure.Arrival.time.convenient 0.0319 -0.0149 0.0829
## 3 Ease.of.Online.booking 0.0137 0.0620 0.000458
## 4 Gate.location 0.00268 0.00795 -0.0546
## 5 Food.and.drink 0.0246 0.0578 0.0773
## 6 Online.boarding 0.203 0.215 0.204
## 7 Seat.comfort 0.155 0.159 0.183
## 8 Inflight.entertainment 0.0695 0.138 0.115
## 9 On.board.service 0.0552 0.118 0.248
## 10 Leg.room.service 0.0330 0.137 0.151
## 11 Baggage.handling -0.0505 0.0712 0.240
## 12 Inflight.service -0.0593 0.0661 0.240
## 13 Cleanliness 0.0489 0.106 0.165
## 14 Departure.Delay.in.Minutes -0.00545 0.00240 -0.0209
## 15 Arrival.Delay.in.Minutes -0.00740 0.000131 -0.0271
Since we are going to investigate the underlying patterns of the dataset and identify the factors that are highly correlated to the satisfaction level of US airline passengers, only numerical and rank data will be used. We will also exclude the Age, Flight.Distance and Checkin.service variable as mentioned above. In a later section, we will consider a subset of all numerical/rank and categorical variables for logistic regression.
As such, several subsets will be created according to the purpose of each analysis.
# numerical and ranked variables
# with satisfaction variable
df <- air[,c(8:18, 20:24)]
cat("Subset df\nDimension:", dim(df), "\n\n")
## Subset df
## Dimension: 25893 16
colnames(df)
## [1] "Inflight.wifi.service" "Departure.Arrival.time.convenient"
## [3] "Ease.of.Online.booking" "Gate.location"
## [5] "Food.and.drink" "Online.boarding"
## [7] "Seat.comfort" "Inflight.entertainment"
## [9] "On.board.service" "Leg.room.service"
## [11] "Baggage.handling" "Inflight.service"
## [13] "Cleanliness" "Departure.Delay.in.Minutes"
## [15] "Arrival.Delay.in.Minutes" "satisfaction"
head(df)
## Inflight.wifi.service Departure.Arrival.time.convenient
## 1 5 4
## 2 1 1
## 3 2 0
## 4 0 0
## 5 2 3
## 6 3 3
## Ease.of.Online.booking Gate.location Food.and.drink Online.boarding
## 1 3 4 3 4
## 2 3 1 5 4
## 3 2 4 2 2
## 4 0 2 3 4
## 5 4 3 4 1
## 6 3 3 5 5
## Seat.comfort Inflight.entertainment On.board.service Leg.room.service
## 1 3 5 5 5
## 2 5 4 4 4
## 3 2 2 4 1
## 4 4 1 1 1
## 5 2 2 2 2
## 6 3 5 4 3
## Baggage.handling Inflight.service Cleanliness Departure.Delay.in.Minutes
## 1 5 5 5 50
## 2 4 4 5 0
## 3 3 2 2 0
## 4 1 1 4 0
## 5 2 2 4 0
## 6 1 2 5 0
## Arrival.Delay.in.Minutes satisfaction
## 1 44 satisfied
## 2 0 satisfied
## 3 0 neutral or dissatisfied
## 4 6 satisfied
## 5 20 satisfied
## 6 0 satisfied
# numerical and ranked variables
# without satisfaction variable (for PCA & FA)
df1 <- df[-16]
cat("Subset df1\nDimension:", dim(df1), "\n\n")
## Subset df1
## Dimension: 25893 15
colnames(df1)
## [1] "Inflight.wifi.service" "Departure.Arrival.time.convenient"
## [3] "Ease.of.Online.booking" "Gate.location"
## [5] "Food.and.drink" "Online.boarding"
## [7] "Seat.comfort" "Inflight.entertainment"
## [9] "On.board.service" "Leg.room.service"
## [11] "Baggage.handling" "Inflight.service"
## [13] "Cleanliness" "Departure.Delay.in.Minutes"
## [15] "Arrival.Delay.in.Minutes"
head(df1)
## Inflight.wifi.service Departure.Arrival.time.convenient
## 1 5 4
## 2 1 1
## 3 2 0
## 4 0 0
## 5 2 3
## 6 3 3
## Ease.of.Online.booking Gate.location Food.and.drink Online.boarding
## 1 3 4 3 4
## 2 3 1 5 4
## 3 2 4 2 2
## 4 0 2 3 4
## 5 4 3 4 1
## 6 3 3 5 5
## Seat.comfort Inflight.entertainment On.board.service Leg.room.service
## 1 3 5 5 5
## 2 5 4 4 4
## 3 2 2 4 1
## 4 4 1 1 1
## 5 2 2 2 2
## 6 3 5 4 3
## Baggage.handling Inflight.service Cleanliness Departure.Delay.in.Minutes
## 1 5 5 5 50
## 2 4 4 5 0
## 3 3 2 2 0
## 4 1 1 4 0
## 5 2 2 4 0
## 6 1 2 5 0
## Arrival.Delay.in.Minutes
## 1 44
## 2 0
## 3 0
## 4 6
## 5 20
## 6 0
# only ranked features
# without satisfaction variable (for CA)
df1_CA <- subset(df1, select = -c(4,7,11,14,15))
cat("Subset df_CA\nDimension:", dim(df1_CA), "\n\n")
## Subset df_CA
## Dimension: 25893 10
colnames(df1_CA)
## [1] "Inflight.wifi.service" "Departure.Arrival.time.convenient"
## [3] "Ease.of.Online.booking" "Food.and.drink"
## [5] "Online.boarding" "Inflight.entertainment"
## [7] "On.board.service" "Leg.room.service"
## [9] "Inflight.service" "Cleanliness"
head(df1_CA)
## Inflight.wifi.service Departure.Arrival.time.convenient
## 1 5 4
## 2 1 1
## 3 2 0
## 4 0 0
## 5 2 3
## 6 3 3
## Ease.of.Online.booking Food.and.drink Online.boarding Inflight.entertainment
## 1 3 3 4 5
## 2 3 5 4 4
## 3 2 2 2 2
## 4 0 3 4 1
## 5 4 4 1 2
## 6 3 5 5 5
## On.board.service Leg.room.service Inflight.service Cleanliness
## 1 5 5 5 5
## 2 4 4 4 5
## 3 4 1 2 2
## 4 1 1 1 4
## 5 2 2 2 4
## 6 4 3 2 5
# all variables in the original dataset except id (for logistic regression model)
all_df <- air[-1]
cat("Subset all_df\nDimension:", dim(all_df), "\n\n")
## Subset all_df
## Dimension: 25893 23
colnames(all_df)
## [1] "Gender" "Customer.Type"
## [3] "Age" "Type.of.Travel"
## [5] "Class" "Flight.Distance"
## [7] "Inflight.wifi.service" "Departure.Arrival.time.convenient"
## [9] "Ease.of.Online.booking" "Gate.location"
## [11] "Food.and.drink" "Online.boarding"
## [13] "Seat.comfort" "Inflight.entertainment"
## [15] "On.board.service" "Leg.room.service"
## [17] "Baggage.handling" "Checkin.service"
## [19] "Inflight.service" "Cleanliness"
## [21] "Departure.Delay.in.Minutes" "Arrival.Delay.in.Minutes"
## [23] "satisfaction"
head(all_df)
## Gender Customer.Type Age Type.of.Travel Class Flight.Distance
## 1 Female Loyal Customer 52 Business travel Eco 160
## 2 Female Loyal Customer 36 Business travel Business 2863
## 3 Male disloyal Customer 20 Business travel Eco 192
## 4 Male Loyal Customer 44 Business travel Business 3377
## 5 Female Loyal Customer 49 Business travel Eco 1182
## 6 Male Loyal Customer 16 Business travel Eco 311
## Inflight.wifi.service Departure.Arrival.time.convenient
## 1 5 4
## 2 1 1
## 3 2 0
## 4 0 0
## 5 2 3
## 6 3 3
## Ease.of.Online.booking Gate.location Food.and.drink Online.boarding
## 1 3 4 3 4
## 2 3 1 5 4
## 3 2 4 2 2
## 4 0 2 3 4
## 5 4 3 4 1
## 6 3 3 5 5
## Seat.comfort Inflight.entertainment On.board.service Leg.room.service
## 1 3 5 5 5
## 2 5 4 4 4
## 3 2 2 4 1
## 4 4 1 1 1
## 5 2 2 2 2
## 6 3 5 4 3
## Baggage.handling Checkin.service Inflight.service Cleanliness
## 1 5 2 5 5
## 2 4 3 4 5
## 3 3 2 2 2
## 4 1 3 1 4
## 5 2 4 2 4
## 6 1 1 2 5
## Departure.Delay.in.Minutes Arrival.Delay.in.Minutes satisfaction
## 1 50 44 satisfied
## 2 0 0 satisfied
## 3 0 0 neutral or dissatisfied
## 4 0 6 satisfied
## 5 0 20 satisfied
## 6 0 0 satisfied
Following the literature review on the study of flooding in Metro Manila, Principle Component can be used to analyze which factor(s) make significant contributions to the model. In fact, PCA is commonly applied to interval data, where it can help to identify patterns and relationships among the variables, and to reduce the dimension of the data to a smaller set of meaningful principal components.
It’s important to consider the scaling of the data before performing the procedure, as PCA is sensitive to the scale of the input variables. As such, we will first standardize the data before applying PCA to ensure that each variable contributes equally to the analysis.
# perform pca on scaled data
airline.pca <- prcomp(df1, scale = TRUE) # only for numeric variables
# eigen vectors
airline.pca$rotation
## PC1 PC2 PC3
## Inflight.wifi.service -0.23925274 0.442446506 -0.072256918
## Departure.Arrival.time.convenient -0.09100587 0.423008024 -0.009790826
## Ease.of.Online.booking -0.16793110 0.525062498 -0.079347656
## Gate.location -0.06816985 0.426679040 -0.068023649
## Food.and.drink -0.32013129 -0.186270227 -0.324389216
## Online.boarding -0.28060823 0.151252291 -0.174564835
## Seat.comfort -0.35303712 -0.173392624 -0.301887713
## Inflight.entertainment -0.44276714 -0.183947275 -0.035481236
## On.board.service -0.27187035 -0.041004501 0.402184247
## Leg.room.service -0.22414466 0.009282316 0.314362300
## Baggage.handling -0.26107477 -0.028056947 0.444962063
## Inflight.service -0.26524868 -0.036398933 0.455456697
## Cleanliness -0.36588486 -0.199220406 -0.295681044
## Departure.Delay.in.Minutes 0.03810765 0.035453456 -0.031515902
## Arrival.Delay.in.Minutes 0.04159285 0.035480879 -0.035143410
## PC4 PC5 PC6
## Inflight.wifi.service 0.014255707 -0.2583079942 0.01615871
## Departure.Arrival.time.convenient 0.018576656 0.4304499504 -0.19652154
## Ease.of.Online.booking 0.020163115 -0.1716503762 0.05025981
## Gate.location 0.014384189 0.4957027522 0.19688093
## Food.and.drink -0.009095508 0.1922186236 0.06927214
## Online.boarding 0.009929605 -0.5922951978 -0.26613221
## Seat.comfort -0.009707081 0.0262176970 -0.02419582
## Inflight.entertainment -0.038343648 0.1388067685 0.04318400
## On.board.service -0.027666962 -0.0109842829 -0.24341472
## Leg.room.service -0.060367847 -0.1784212240 0.84056064
## Baggage.handling -0.052896100 0.0743482204 -0.18386908
## Inflight.service -0.004691003 0.0977359580 -0.19956432
## Cleanliness -0.024280614 0.1343504709 0.02902730
## Departure.Delay.in.Minutes -0.703556951 -0.0017109787 -0.02802289
## Arrival.Delay.in.Minutes -0.703051041 -0.0008505072 -0.02646164
## PC7 PC8 PC9
## Inflight.wifi.service -0.331814354 0.203126383 -0.172887764
## Departure.Arrival.time.convenient 0.616187157 0.435267124 0.007399416
## Ease.of.Online.booking -0.192705604 0.074744900 -0.089768862
## Gate.location -0.175084370 -0.630002080 0.146378703
## Food.and.drink -0.238111965 0.315086823 -0.227086287
## Online.boarding 0.258216460 -0.210124796 0.246312156
## Seat.comfort 0.299475384 -0.273020343 0.234586435
## Inflight.entertainment -0.165104081 0.010025613 -0.120499899
## On.board.service 0.213008874 -0.345405221 -0.694872721
## Leg.room.service 0.298555787 0.083790568 0.037393682
## Baggage.handling -0.182430786 0.119854856 0.475004088
## Inflight.service -0.183196318 0.080080703 0.214798661
## Cleanliness 0.026848489 0.040122396 0.052278611
## Departure.Delay.in.Minutes -0.002968948 -0.002197422 -0.002899163
## Arrival.Delay.in.Minutes -0.004984626 -0.001288131 -0.005081611
## PC10 PC11 PC12
## Inflight.wifi.service -0.0391357702 0.384060682 0.177449542
## Departure.Arrival.time.convenient 0.0148089349 -0.012165847 -0.025975003
## Ease.of.Online.booking 0.0030552308 -0.018017931 0.049808244
## Gate.location 0.0001411596 -0.177704492 -0.109646749
## Food.and.drink -0.0692629335 -0.657893067 0.153692996
## Online.boarding 0.0562326993 -0.389860762 -0.264886945
## Seat.comfort 0.0168528329 0.241766353 0.663287572
## Inflight.entertainment 0.0541198690 0.165538175 -0.057945723
## On.board.service -0.1614731331 -0.022311121 -0.004523106
## Leg.room.service 0.0155351694 -0.080645772 -0.024227512
## Baggage.handling -0.6417776514 -0.037900979 0.022180215
## Inflight.service 0.7400799534 -0.061252834 0.045749796
## Cleanliness -0.0068494931 0.370882894 -0.642616016
## Departure.Delay.in.Minutes 0.0260697207 -0.005101518 0.008508640
## Arrival.Delay.in.Minutes 0.0207106884 -0.002688373 0.004332571
## PC13 PC14 PC15
## Inflight.wifi.service 0.5305200744 0.177102775 -0.0001820452
## Departure.Arrival.time.convenient 0.0924398214 -0.098220519 -0.0002768573
## Ease.of.Online.booking -0.7755102724 -0.049020766 0.0011223177
## Gate.location 0.1753357304 0.025544328 -0.0003144714
## Food.and.drink 0.0657604889 0.206510285 0.0018226304
## Online.boarding 0.1824869919 -0.126697223 0.0004190597
## Seat.comfort -0.1176710431 0.144882554 0.0041637389
## Inflight.entertainment 0.0463037080 -0.817617553 -0.0068686624
## On.board.service -0.0406325198 0.163530509 0.0023330957
## Leg.room.service 0.0420493392 0.046602660 0.0014280155
## Baggage.handling -0.0391020525 0.048132253 0.0005464738
## Inflight.service -0.0239942538 0.176585085 0.0066054111
## Cleanliness -0.1367458514 0.381063375 0.0011106019
## Departure.Delay.in.Minutes -0.0008395439 0.010958723 -0.7068211337
## Arrival.Delay.in.Minutes 0.0020251768 -0.001120516 0.7073059859
# variance of PCs / eigen values
s <- (airline.pca$sdev)^2
# a table of PC number and cumulative variance
eigenval_table <- data.frame(
eigen_value = s
)
rownames(eigenval_table) <- paste0("PC", 1:length(s))
# print the table
print(eigenval_table)
## eigen_value
## PC1 3.70318105
## PC2 2.40761587
## PC3 2.15207400
## PC4 1.96140530
## PC5 1.03920635
## PC6 0.71103945
## PC7 0.57143431
## PC8 0.50776351
## PC9 0.47330385
## PC10 0.36360545
## PC11 0.33100581
## PC12 0.28709740
## PC13 0.25918968
## PC14 0.19664017
## PC15 0.03543781
There are 15 principle components (PCs), each consisting of parts of all variables and capture different portions of the total variance. To determine the optimal number of PCs to retain, we use 3 approaches: Scree plot, Total Variance and Kaiser Criterion.
# Scree plot
fviz_eig(airline.pca) +
geom_vline(xintercept = 5, linetype = "dashed", col = 'red')
# total variance
# cumulative variance
rs <- s/sum(s)
# a table of PC number and cumulative variance
var_table <- data.frame(
cumulative_variance = cumsum(rs)
)
rownames(var_table) <- paste0("PC", 1:length(rs))
# print the table
print(var_table)
## cumulative_variance
## PC1 0.2468787
## PC2 0.4073865
## PC3 0.5508581
## PC4 0.6816184
## PC5 0.7508988
## PC6 0.7983015
## PC7 0.8363971
## PC8 0.8702480
## PC9 0.9018016
## PC10 0.9260419
## PC11 0.9481090
## PC12 0.9672488
## PC13 0.9845281
## PC14 0.9976375
## PC15 1.0000000
# Kaiser criterion
cat("Mean variance =", mean(s), "\n\n")
## Mean variance = 1
# print the table
print(eigenval_table)
## eigen_value
## PC1 3.70318105
## PC2 2.40761587
## PC3 2.15207400
## PC4 1.96140530
## PC5 1.03920635
## PC6 0.71103945
## PC7 0.57143431
## PC8 0.50776351
## PC9 0.47330385
## PC10 0.36360545
## PC11 0.33100581
## PC12 0.28709740
## PC13 0.25918968
## PC14 0.19664017
## PC15 0.03543781
Scree plot suggests retaining PC1 to PC5 since there are 5 points involved in the steep part of the curve. In total variance, 4 PCs capture around 67.8% of the total variance, 5 PCs capture around 74.8%. If we aim to capture 80 - 90% of the total variance, at least 7 PCs will be required. In Kaiser criterion, variance of PC1 to PC5 are greater than the average of variance, so 4 - 5 PCs are should be retained. After comparing the result, 5 PCs are selected in our PCA.
Then, we obtain the loadings and check if they have some meaningful interpretation.
# obtain loadings for eigen vectors
cbind(airline.pca$sdev[1]*airline.pca$rotation[,1],
airline.pca$sdev[2]*airline.pca$rotation[,2],
airline.pca$sdev[3]*airline.pca$rotation[,3],
airline.pca$sdev[4]*airline.pca$rotation[,4],
airline.pca$sdev[5]*airline.pca$rotation[,5])
## [,1] [,2] [,3]
## Inflight.wifi.service -0.46040963 0.68652186 -0.10600053
## Departure.Arrival.time.convenient -0.17512852 0.65636015 -0.01436309
## Ease.of.Online.booking -0.32316075 0.81471291 -0.11640261
## Gate.location -0.13118367 0.66205628 -0.09979035
## Food.and.drink -0.61604949 -0.28902609 -0.47587732
## Online.boarding -0.53999268 0.23469053 -0.25608572
## Seat.comfort -0.67937232 -0.26904456 -0.44286773
## Inflight.entertainment -0.85204564 -0.28542168 -0.05205079
## On.board.service -0.52317782 -0.06362461 0.59000223
## Leg.room.service -0.43133617 0.01440290 0.46116789
## Baggage.handling -0.50240317 -0.04353454 0.65275707
## Inflight.service -0.51043530 -0.05647838 0.66815265
## Cleanliness -0.70409605 -0.30912022 -0.43376258
## Departure.Delay.in.Minutes 0.07333304 0.05501133 -0.04623367
## Arrival.Delay.in.Minutes 0.08003982 0.05505388 -0.05155520
## [,4] [,5]
## Inflight.wifi.service 0.019965144 -0.2633229687
## Departure.Arrival.time.convenient 0.026016640 0.4388070109
## Ease.of.Online.booking 0.028238479 -0.1749829183
## Gate.location 0.020145082 0.5053266769
## Food.and.drink -0.012738276 0.1959504923
## Online.boarding 0.013906430 -0.6037944367
## Seat.comfort -0.013594784 0.0267267060
## Inflight.entertainment -0.053700348 0.1415016615
## On.board.service -0.038747630 -0.0111975395
## Leg.room.service -0.084545279 -0.1818852202
## Baggage.handling -0.074081083 0.0757916695
## Inflight.service -0.006569759 0.0996334731
## Cleanliness -0.034005044 0.1369588462
## Departure.Delay.in.Minutes -0.985332778 -0.0017441968
## Arrival.Delay.in.Minutes -0.984624251 -0.0008670196
For example, in PC1, a highly correlated group would be Inflight.entertainment, Cleanliness and Seat.comfort, while for PC2, a highly correlated group would be Gate.location, Ease.of.Online.booking, Inflight.wifi.service and Departure.Arrival.time.convenient. To further enhance the interpretation of the original variables, varimax rotation is considered.
# scale df1
center_df1 <- scale(df1, center = TRUE, scale = TRUE)
# apply varimax rotation to scaled df1
fit <- principal(center_df1, nfactors = 5, rotate = "varimax")
fit$loadings
##
## Loadings:
## RC1 RC3 RC4 RC5 RC2
## Inflight.wifi.service 0.121 0.769 0.392
## Departure.Arrival.time.convenient 0.131 0.797
## Ease.of.Online.booking 0.730 0.525
## Gate.location 0.843
## Food.and.drink 0.852
## Online.boarding 0.305 0.811 -0.145
## Seat.comfort 0.836 0.167
## Inflight.entertainment 0.789 0.454
## On.board.service 0.784
## Leg.room.service 0.623 0.206
## Baggage.handling 0.827
## Inflight.service 0.841
## Cleanliness 0.889
## Departure.Delay.in.Minutes 0.990
## Arrival.Delay.in.Minutes 0.990
##
## RC1 RC3 RC4 RC5 RC2
## SS loadings 2.957 2.631 1.969 1.892 1.815
## Proportion Var 0.197 0.175 0.131 0.126 0.121
## Cumulative Var 0.197 0.373 0.504 0.630 0.751
The first 5 components captured around 75.1% of variance in vairmax rotation. Considering a cut-off value of 0.6:
Cleanliness, Food.and.drink, Inflight.entertainment and Seat.comfort are loaded on component 1, categorized as In-flight experience.
Departure.Arrival.time.convenient and Gate.location are loaded on component 2, categorized as Convenience.
Baggage.handling, Inflight.service, Leg.room.service and On.board.service are loaded on component 3, categorized as Service.
Departure.Delay.in.Minutes and Arrival.Delay.in.Minutes loaded on component 4, categorized as Punctuation.
Inflight.wifi.service, Ease.of.Online.booking and Online.boarding are loaded on component 5, categorized as Technology.
From the PCA results, we can also produce a 2-dimensional scatter plot for pairs of components, with each point color-coded according to the response of the satisfaction it represents. The following plot uses PC1 and PC4 to evaluate whether components can be classified by the response variable satisfaction. However, the result is not as expected. This could be due to PC1 and PC4 only capturing 33% of the total variance.
# visualization
pca_data <- as.data.frame(center_df1 %*% fit$loadings)
pca_data$satisfaction <- df$satisfaction
ggplot(pca_data, aes(x = RC1, y = RC4, color = satisfaction)) +
geom_point()
Try to produce a 2-dimensional scatter plot for pairs of components and
compare with other catagorical variales like gender and class, the
results are similar.
# visualization
pca_data2 <- as.data.frame(center_df1 %*% fit$loadings)
pca_data2$Class <- air$Class
ggplot(pca_data2, aes(x = RC1, y = RC4, color = Class)) +
geom_point()
# visualization
pca_data2 <- as.data.frame(center_df1 %*% fit$loadings)
pca_data2$Gender <- air$Gender
ggplot(pca_data2, aes(x = RC1, y = RC4, color = Gender)) +
geom_point()
Apart from principal component analysis, we could also use factor analysis to identify latent variables in the data sets.
Before conducting factor analysis, it is essential to first assess whether our dataset is appropriate for such analysis. This can be done using Bartlett’s test and KMO test.
Bartlett’s test is a statistical test used to determine whether the correlation matrix of a set of variables is suitable for factor analysis. It tests the null hypothesis that the correlation matrix is an identity matrix, indicating that the variables are uncorrelated and unsuitable for factor analysis. If the null hypothesis is rejected, it indicates that there is significant correlation among the variables, suggesting that factor analysis may be appropriate.
In practice, Bartlett’s test is often used in conjunction with the Kaiser-Meyer-Olkin (KMO) measure of sampling adequacy to assess the suitability of data for factor analysis. The KMO measure assesses the degree of common variance among the variables, while Bartlett’s test assesses the overall significance of the correlations. If both tests indicate that the data is suitable for factor analysis, then we can proceed with conducting the analysis.
Our Bartlett’s test produced a p-value of close to 0, which is less than significant level of 0.05, we can reject the null hypothesis and conclude that a factor analysis may be useful for our dataset.
At the same time, KMO test gave an overall MSA of 0.73, meaning there is a significant number of factors in our dataset.
# Bartlett's test
correlations = cor(df1)
cortest.bartlett(correlations, n = nrow(df1))
## $chisq
## [1] 215937.2
##
## $p.value
## [1] 0
##
## $df
## [1] 105
# KMO test
KMO(correlations)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = correlations)
## Overall MSA = 0.74
## MSA for each item =
## Inflight.wifi.service Departure.Arrival.time.convenient
## 0.74 0.76
## Ease.of.Online.booking Gate.location
## 0.69 0.70
## Food.and.drink Online.boarding
## 0.84 0.72
## Seat.comfort Inflight.entertainment
## 0.82 0.79
## On.board.service Leg.room.service
## 0.83 0.89
## Baggage.handling Inflight.service
## 0.81 0.78
## Cleanliness Departure.Delay.in.Minutes
## 0.83 0.50
## Arrival.Delay.in.Minutes
## 0.50
Similar to principle component analysis, we can determine the number of factors using parallel analysis, Kaiser criterion and Scree plot.
Maximum likelihood estimation (MLE) is used for conducting factor analysis. MLE for factor analysis provides a flexible and powerful method for modeling the underlying factor structure of a set of variables, and is well-suited to a wide range of data types and assumptions.
# parallel analysis
nofactors = fa.parallel(df1, fm="ml", fa="fa")
## Parallel analysis suggests that the number of factors = 5 and the number of components = NA
# Kaiser criterion
# extract the number of eigenvalues
cat("Old criterion (> 1.0) - Number of factors:", sum(nofactors$fa.values > 1.0),"\n")
## Old criterion (> 1.0) - Number of factors: 4
cat("New criterion (> 0.7) - Number of factors:", sum(nofactors$fa.values > .7))
## New criterion (> 0.7) - Number of factors: 4
# Scree plot and parallel analysis
ev <- eigen(cor(df1)) # get eigenvalues
ap <- parallel(subject=nrow(df1),var=ncol(df1),
rep=100,cent=.05)
nS <- nScree(x=ev$values, aparallel=ap$eigen$qevpea)
plotnScree(nS)
From the analysis above, the optimal number of factors is suggested to be between 4 and 5. Since we want to compare the result of PCA and FA, we will take the same number of factors as the number of principla components, which is 5.
# no rotation
df.fa <- fa(df1, nfactors = 5, rotate = "none", fm="ml")
# summary of the FA results
df.fa # some empty loading because the value is too small.
## Factor Analysis using method = ml
## Call: fa(r = df1, nfactors = 5, rotate = "none", fm = "ml")
## Standardized loadings (pattern matrix) based upon correlation matrix
## ML2 ML3 ML1 ML4 ML5 h2 u2 com
## Inflight.wifi.service 0.55 -0.28 -0.01 0.45 0.22 0.64 0.363 2.8
## Departure.Arrival.time.convenient 0.15 -0.21 0.00 0.41 0.23 0.29 0.713 2.4
## Ease.of.Online.booking 0.48 -0.45 0.00 0.53 0.31 0.82 0.184 3.6
## Gate.location 0.09 -0.19 0.01 0.44 0.35 0.36 0.641 2.4
## Food.and.drink 0.43 0.52 -0.03 -0.12 0.38 0.62 0.384 2.9
## Online.boarding 0.91 -0.25 -0.03 -0.18 -0.13 0.94 0.060 1.3
## Seat.comfort 0.57 0.46 -0.04 -0.21 0.23 0.63 0.369 2.6
## Inflight.entertainment 0.55 0.72 -0.03 0.15 0.07 0.84 0.159 2.0
## On.board.service 0.26 0.32 -0.03 0.37 -0.44 0.50 0.497 3.5
## Leg.room.service 0.22 0.20 0.01 0.32 -0.26 0.26 0.744 3.5
## Baggage.handling 0.21 0.34 -0.01 0.45 -0.47 0.58 0.418 3.2
## Inflight.service 0.20 0.36 -0.06 0.48 -0.49 0.65 0.353 3.2
## Cleanliness 0.53 0.58 -0.02 -0.15 0.31 0.73 0.270 2.7
## Departure.Delay.in.Minutes 0.01 0.00 0.97 0.00 -0.01 0.93 0.065 1.0
## Arrival.Delay.in.Minutes 0.00 0.00 1.00 0.00 0.00 1.00 0.005 1.0
##
## ML2 ML3 ML1 ML4 ML5
## SS loadings 2.68 2.14 1.94 1.66 1.35
## Proportion Var 0.18 0.14 0.13 0.11 0.09
## Cumulative Var 0.18 0.32 0.45 0.56 0.65
## Proportion Explained 0.27 0.22 0.20 0.17 0.14
## Cumulative Proportion 0.27 0.49 0.69 0.86 1.00
##
## Mean item complexity = 2.6
## Test of the hypothesis that 5 factors are sufficient.
##
## df null model = 105 with the objective function = 8.34 with Chi Square = 215937.2
## df of the model are 40 and the objective function was 0.14
##
## The root mean square of the residuals (RMSR) is 0.02
## The df corrected root mean square of the residuals is 0.03
##
## The harmonic n.obs is 25893 with the empirical chi square 2190.43 with prob < 0
## The total n.obs was 25893 with Likelihood Chi Square = 3581.96 with prob < 0
##
## Tucker Lewis Index of factoring reliability = 0.957
## RMSEA index = 0.058 and the 90 % confidence intervals are 0.057 0.06
## BIC = 3175.49
## Fit based upon off diagonal values = 0.99
## Measures of factor score adequacy
## ML2 ML3 ML1 ML4 ML5
## Correlation of (regression) scores with factors 0.98 0.95 1.00 0.92 0.89
## Multiple R square of scores with factors 0.95 0.90 1.00 0.84 0.79
## Minimum correlation of possible factor scores 0.91 0.80 0.99 0.68 0.58
# diagram
fa.diagram(df.fa)
Without any rotation, some groups can still be easily categorized, i.e. ML1 for punctuation and ML4 for convenience, while other groups are not as apparent. Let’s consider rotation such that it redistributes the variance across factors to enhance the interpretation.
# varimax rotation
df.fa.varimax <- fa(df1, nfactors = 5, rotate = "varimax", fm="ml")
# summary of the FA results
df.fa.varimax # some empty loading because the value is too small.
## Factor Analysis using method = ml
## Call: fa(r = df1, nfactors = 5, rotate = "varimax", fm = "ml")
## Standardized loadings (pattern matrix) based upon correlation matrix
## ML3 ML5 ML1 ML4 ML2 h2 u2 com
## Inflight.wifi.service 0.12 0.13 -0.01 0.71 0.31 0.64 0.363 1.5
## Departure.Arrival.time.convenient -0.04 0.04 0.00 0.53 -0.02 0.29 0.713 1.0
## Ease.of.Online.booking -0.01 0.04 0.00 0.86 0.27 0.82 0.184 1.2
## Gate.location 0.00 -0.03 0.01 0.59 -0.11 0.36 0.641 1.1
## Food.and.drink 0.78 0.00 -0.01 0.03 0.01 0.62 0.384 1.0
## Online.boarding 0.28 0.09 -0.01 0.19 0.90 0.94 0.060 1.3
## Seat.comfort 0.76 0.05 -0.02 -0.03 0.22 0.63 0.369 1.2
## Inflight.entertainment 0.79 0.47 -0.01 0.04 0.01 0.84 0.159 1.6
## On.board.service 0.09 0.70 -0.02 0.00 0.07 0.50 0.497 1.1
## Leg.room.service 0.08 0.49 0.02 0.09 0.05 0.26 0.744 1.1
## Baggage.handling 0.05 0.76 0.01 0.02 0.00 0.58 0.418 1.0
## Inflight.service 0.05 0.80 -0.05 0.02 -0.02 0.65 0.353 1.0
## Cleanliness 0.85 0.07 0.00 -0.01 0.09 0.73 0.270 1.0
## Departure.Delay.in.Minutes -0.02 -0.01 0.97 0.00 -0.01 0.93 0.065 1.0
## Arrival.Delay.in.Minutes -0.02 -0.02 1.00 0.00 -0.01 1.00 0.005 1.0
##
## ML3 ML5 ML1 ML4 ML2
## SS loadings 2.64 2.21 1.93 1.93 1.06
## Proportion Var 0.18 0.15 0.13 0.13 0.07
## Cumulative Var 0.18 0.32 0.45 0.58 0.65
## Proportion Explained 0.27 0.23 0.20 0.20 0.11
## Cumulative Proportion 0.27 0.50 0.69 0.89 1.00
##
## Mean item complexity = 1.1
## Test of the hypothesis that 5 factors are sufficient.
##
## df null model = 105 with the objective function = 8.34 with Chi Square = 215937.2
## df of the model are 40 and the objective function was 0.14
##
## The root mean square of the residuals (RMSR) is 0.02
## The df corrected root mean square of the residuals is 0.03
##
## The harmonic n.obs is 25893 with the empirical chi square 2190.43 with prob < 0
## The total n.obs was 25893 with Likelihood Chi Square = 3581.96 with prob < 0
##
## Tucker Lewis Index of factoring reliability = 0.957
## RMSEA index = 0.058 and the 90 % confidence intervals are 0.057 0.06
## BIC = 3175.49
## Fit based upon off diagonal values = 0.99
## Measures of factor score adequacy
## ML3 ML5 ML1 ML4 ML2
## Correlation of (regression) scores with factors 0.94 0.91 1.00 0.92 0.95
## Multiple R square of scores with factors 0.89 0.84 1.00 0.85 0.91
## Minimum correlation of possible factor scores 0.78 0.67 0.99 0.70 0.82
# diagram
fa.diagram(df.fa.varimax)
# oblimin rotation
df.fa.oblimin <- fa(df1, nfactors = 5, rotate = "oblimin", fm="ml")
## Loading required namespace: GPArotation
# summary of the FA results
df.fa.oblimin # some empty loading because the value is too small.
## Factor Analysis using method = ml
## Call: fa(r = df1, nfactors = 5, rotate = "oblimin", fm = "ml")
## Standardized loadings (pattern matrix) based upon correlation matrix
## ML3 ML5 ML4 ML1 ML2 h2 u2 com
## Inflight.wifi.service 0.06 0.07 0.70 0.00 0.17 0.64 0.363 1.1
## Departure.Arrival.time.convenient -0.01 0.02 0.56 0.00 -0.14 0.29 0.713 1.1
## Ease.of.Online.booking -0.05 -0.02 0.87 0.00 0.10 0.82 0.184 1.0
## Gate.location 0.06 -0.06 0.64 0.00 -0.26 0.36 0.641 1.4
## Food.and.drink 0.82 -0.13 0.04 -0.01 -0.06 0.62 0.384 1.1
## Online.boarding 0.04 0.01 0.06 0.00 0.93 0.94 0.060 1.0
## Seat.comfort 0.73 -0.07 -0.06 -0.01 0.20 0.63 0.369 1.2
## Inflight.entertainment 0.78 0.35 0.03 0.01 -0.05 0.84 0.159 1.4
## On.board.service 0.01 0.70 -0.03 -0.01 0.07 0.50 0.497 1.0
## Leg.room.service 0.03 0.48 0.07 0.03 0.03 0.26 0.744 1.1
## Baggage.handling -0.02 0.77 0.00 0.02 0.00 0.58 0.418 1.0
## Inflight.service -0.01 0.81 0.00 -0.03 -0.02 0.65 0.353 1.0
## Cleanliness 0.85 -0.06 -0.02 0.00 0.04 0.73 0.270 1.0
## Departure.Delay.in.Minutes 0.00 0.00 0.00 0.97 0.00 0.93 0.065 1.0
## Arrival.Delay.in.Minutes 0.00 0.00 0.00 1.00 0.00 1.00 0.005 1.0
##
## ML3 ML5 ML4 ML1 ML2
## SS loadings 2.61 2.14 2.00 1.93 1.10
## Proportion Var 0.17 0.14 0.13 0.13 0.07
## Cumulative Var 0.17 0.32 0.45 0.58 0.65
## Proportion Explained 0.27 0.22 0.20 0.20 0.11
## Cumulative Proportion 0.27 0.49 0.69 0.89 1.00
##
## With factor correlations of
## ML3 ML5 ML4 ML1 ML2
## ML3 1.00 0.23 0.05 -0.03 0.32
## ML5 0.23 1.00 0.09 -0.03 0.10
## ML4 0.05 0.09 1.00 0.00 0.34
## ML1 -0.03 -0.03 0.00 1.00 -0.02
## ML2 0.32 0.10 0.34 -0.02 1.00
##
## Mean item complexity = 1.1
## Test of the hypothesis that 5 factors are sufficient.
##
## df null model = 105 with the objective function = 8.34 with Chi Square = 215937.2
## df of the model are 40 and the objective function was 0.14
##
## The root mean square of the residuals (RMSR) is 0.02
## The df corrected root mean square of the residuals is 0.03
##
## The harmonic n.obs is 25893 with the empirical chi square 2190.43 with prob < 0
## The total n.obs was 25893 with Likelihood Chi Square = 3581.96 with prob < 0
##
## Tucker Lewis Index of factoring reliability = 0.957
## RMSEA index = 0.058 and the 90 % confidence intervals are 0.057 0.06
## BIC = 3175.49
## Fit based upon off diagonal values = 0.99
## Measures of factor score adequacy
## ML3 ML5 ML4 ML1 ML2
## Correlation of (regression) scores with factors 0.95 0.92 0.93 1.00 0.97
## Multiple R square of scores with factors 0.91 0.84 0.87 1.00 0.94
## Minimum correlation of possible factor scores 0.82 0.69 0.74 0.99 0.88
# diagram
fa.diagram(df.fa.oblimin)
Varimax rotation produced clearer categorization for the variables. Specifically:
In-flight experience: Cleanliness, Seat.comfort, Food.and.drink, Inflight.entertainment
Convenience: Ease.of.Online.booking, Inflight.wifi.service, Gate.location, Departure.Arrival.time.convenient
Service: Inflight.service, Baggage.handling, On.board.service, Leg.room.service
Punctuation: Departure.Delay.in.Minutes, Arrival.Delay.in.Minutes
Other: Online.boarding
It’s no surprise that the categorization is similar to PCA and some grouppings are exactly the same. In terms of grouping, varimax rotation gave a more distinguish result in this dataset.
To maximize the separation between groups, we will conduct Linear Discriminant Analysis (LDA).
# split the dataset into train set and test set (80/20)
set.seed(100)
sample <- sample.int(n = nrow(df), size = floor(.8*nrow(df)), replace = F) # 80% of dataset
train <- df[sample, ]
test <- df[-sample, ]
# LDA on train set using all the variables
lda.sa <- lda(satisfaction~. , data=train)
lda.sa
## Call:
## lda(satisfaction ~ ., data = train)
##
## Prior probabilities of groups:
## neutral or dissatisfied satisfied
## 0.5597663 0.4402337
##
## Group means:
## Inflight.wifi.service Departure.Arrival.time.convenient
## neutral or dissatisfied 2.393618 3.133161
## satisfied 3.152319 2.942099
## Ease.of.Online.booking Gate.location Food.and.drink
## neutral or dissatisfied 2.560414 3.003967 2.965502
## satisfied 3.019191 2.962057 3.553679
## Online.boarding Seat.comfort Inflight.entertainment
## neutral or dissatisfied 2.669340 3.049418 2.889263
## satisfied 4.021713 3.957780 3.957451
## On.board.service Leg.room.service Baggage.handling
## neutral or dissatisfied 3.018284 2.988098 3.371712
## satisfied 3.852615 3.808751 3.972585
## Inflight.service Cleanliness Departure.Delay.in.Minutes
## neutral or dissatisfied 3.391807 2.920828 15.94756
## satisfied 3.978397 3.758855 12.19048
## Arrival.Delay.in.Minutes
## neutral or dissatisfied 16.78456
## satisfied 12.31407
##
## Coefficients of linear discriminants:
## LD1
## Inflight.wifi.service 0.167307082
## Departure.Arrival.time.convenient -0.174008029
## Ease.of.Online.booking -0.085913703
## Gate.location 0.055907376
## Food.and.drink -0.037248138
## Online.boarding 0.568857329
## Seat.comfort 0.092439547
## Inflight.entertainment 0.128792389
## On.board.service 0.207621464
## Leg.room.service 0.230826043
## Baggage.handling 0.075863299
## Inflight.service 0.026366805
## Cleanliness 0.096395010
## Departure.Delay.in.Minutes 0.003038314
## Arrival.Delay.in.Minutes -0.005180921
# prediction on test set
sa.pred <- predict(lda.sa, newdata = test)
# accuracy and classification report
summary(sa.pred$class)
## neutral or dissatisfied satisfied
## 2913 2266
xtab <- table(sa.pred$class, test$satisfaction)
# confusion matrix
caret::confusionMatrix(xtab, positive = "satisfied")
## Registered S3 methods overwritten by 'proxy':
## method from
## print.registry_field registry
## print.registry_entry registry
## Confusion Matrix and Statistics
##
##
## neutral or dissatisfied satisfied
## neutral or dissatisfied 2440 473
## satisfied 493 1773
##
## Accuracy : 0.8135
## 95% CI : (0.8026, 0.824)
## No Information Rate : 0.5663
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.6207
##
## Mcnemar's Test P-Value : 0.541
##
## Sensitivity : 0.7894
## Specificity : 0.8319
## Pos Pred Value : 0.7824
## Neg Pred Value : 0.8376
## Prevalence : 0.4337
## Detection Rate : 0.3423
## Detection Prevalence : 0.4375
## Balanced Accuracy : 0.8107
##
## 'Positive' Class : satisfied
##
# store probabilities assigned to cases
pb <- sa.pred$posterior
# create data frame from probabilities
pb <- as.data.frame(pb)
# combine probabilities and cases in data frame
pred.LDA <- data.frame(test$satisfaction,pb$satisfied)
# change column names
colnames(pred.LDA) <- c("target","score")
# transform the target to 1 and 0
labels <- as.factor(ifelse(pred.LDA$target=="satisfied", 1, 0))
predictions <- pred.LDA$score
# ROC curve
plot(roc(predictions, labels), min=0, max=1, type="l", main="LDA - ROC Chart")
text(0.5, 0.8, paste0("LDA AUC = ", round(auc(roc(predictions, labels)), 5)), col = "#3F7ED5", cex = 1.0)
The AUC value of 0.87 (> 0.8) and an accuracy of 81.35% indicate that LDA model performed relatively well in predicting the satisfaction level.
# Add predictions as a new column to the dataset for further analysis
df_class <- df
lda.full <- lda(satisfaction~. , data=df)
pred.full <- predict(lda.full)
df_class$lda_class <- pred.full$class
head(df_class)
## Inflight.wifi.service Departure.Arrival.time.convenient
## 1 5 4
## 2 1 1
## 3 2 0
## 4 0 0
## 5 2 3
## 6 3 3
## Ease.of.Online.booking Gate.location Food.and.drink Online.boarding
## 1 3 4 3 4
## 2 3 1 5 4
## 3 2 4 2 2
## 4 0 2 3 4
## 5 4 3 4 1
## 6 3 3 5 5
## Seat.comfort Inflight.entertainment On.board.service Leg.room.service
## 1 3 5 5 5
## 2 5 4 4 4
## 3 2 2 4 1
## 4 4 1 1 1
## 5 2 2 2 2
## 6 3 5 4 3
## Baggage.handling Inflight.service Cleanliness Departure.Delay.in.Minutes
## 1 5 5 5 50
## 2 4 4 5 0
## 3 3 2 2 0
## 4 1 1 4 0
## 5 2 2 4 0
## 6 1 2 5 0
## Arrival.Delay.in.Minutes satisfaction lda_class
## 1 44 satisfied satisfied
## 2 0 satisfied satisfied
## 3 0 neutral or dissatisfied neutral or dissatisfied
## 4 6 satisfied neutral or dissatisfied
## 5 20 satisfied neutral or dissatisfied
## 6 0 satisfied satisfied
Before we proceed to perform forward selection, backward elimination and step-wise regression on the same train and test data in LDA session, we first need to convert our response variable, satisfaction into numerical value.
We will have 2 new dataframes (train_re & test_re) with all 15 predictors and 1 response variable numerically encoded.
train_re <- train
test_re <- test
# convert the response variable into numerical data
train_re$satisfaction <- ifelse(train_re$satisfaction=="satisfied", 1, 0)
test_re$satisfaction <- ifelse(test_re$satisfaction=="satisfied", 1, 0)
head(train_re)
## Inflight.wifi.service Departure.Arrival.time.convenient
## 20170 2 1
## 16887 1 1
## 3430 4 0
## 3696 2 2
## 20474 3 5
## 24270 1 1
## Ease.of.Online.booking Gate.location Food.and.drink Online.boarding
## 20170 2 3 5 2
## 16887 2 1 5 5
## 3430 4 1 3 4
## 3696 2 4 4 2
## 20474 3 4 3 3
## 24270 1 1 3 4
## Seat.comfort Inflight.entertainment On.board.service Leg.room.service
## 20170 5 5 3 5
## 16887 5 5 5 2
## 3430 3 3 4 1
## 3696 4 4 4 5
## 20474 4 2 2 3
## 24270 5 4 4 4
## Baggage.handling Inflight.service Cleanliness Departure.Delay.in.Minutes
## 20170 4 3 5 22
## 16887 5 4 5 0
## 3430 2 1 3 21
## 3696 4 3 4 0
## 20474 2 2 3 2
## 24270 4 4 5 0
## Arrival.Delay.in.Minutes satisfaction
## 20170 7 0
## 16887 47 1
## 3430 27 1
## 3696 12 0
## 20474 4 0
## 24270 0 1
head(test_re)
## Inflight.wifi.service Departure.Arrival.time.convenient
## 5 2 3
## 6 3 3
## 9 5 2
## 12 2 5
## 13 5 5
## 17 2 5
## Ease.of.Online.booking Gate.location Food.and.drink Online.boarding
## 5 4 3 4 1
## 6 3 3 5 5
## 9 2 2 5 5
## 12 5 5 1 3
## 13 5 5 4 5
## 17 5 5 2 2
## Seat.comfort Inflight.entertainment On.board.service Leg.room.service
## 5 2 2 2 2
## 6 3 5 4 3
## 9 5 5 2 2
## 12 4 2 2 2
## 13 5 5 5 5
## 17 2 2 4 3
## Baggage.handling Inflight.service Cleanliness Departure.Delay.in.Minutes
## 5 2 2 4 0
## 6 1 2 5 0
## 9 5 3 5 1
## 12 2 2 4 18
## 13 5 5 3 0
## 17 3 3 2 2
## Arrival.Delay.in.Minutes satisfaction
## 5 20 1
## 6 0 1
## 9 0 1
## 12 7 0
## 13 0 1
## 17 0 0
We will also need to define an intercept-only model an a model with all predictors included, using the train_re data.
# define intercept-only model
intercept_only <- glm(satisfaction ~ 1, data=train_re, family = "binomial"(link="logit"))
# define model with all predictors
all <- glm(satisfaction ~ ., data=train_re, family = "binomial"(link="logit"))
# perform forward stepwise regression
forward <- step(intercept_only, direction='forward', scope=formula(all), trace=0)
# view results of forward stepwise regression
summary(forward)
##
## Call:
## glm(formula = satisfaction ~ Online.boarding + Inflight.entertainment +
## Leg.room.service + On.board.service + Departure.Arrival.time.convenient +
## Inflight.wifi.service + Seat.comfort + Arrival.Delay.in.Minutes +
## Ease.of.Online.booking + Cleanliness + Baggage.handling +
## Gate.location + Departure.Delay.in.Minutes + Inflight.service +
## Food.and.drink, family = binomial(link = "logit"), data = train_re)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6668 -0.7052 -0.2557 0.6598 3.5119
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.124802 0.123413 -57.731 < 2e-16 ***
## Online.boarding 0.839480 0.019052 44.062 < 2e-16 ***
## Inflight.entertainment 0.159662 0.025462 6.271 3.59e-10 ***
## Leg.room.service 0.356670 0.015826 22.537 < 2e-16 ***
## On.board.service 0.322724 0.018974 17.009 < 2e-16 ***
## Departure.Arrival.time.convenient -0.277757 0.014535 -19.109 < 2e-16 ***
## Inflight.wifi.service 0.324612 0.022605 14.360 < 2e-16 ***
## Seat.comfort 0.137125 0.020757 6.606 3.95e-11 ***
## Arrival.Delay.in.Minutes -0.007914 0.001887 -4.194 2.74e-05 ***
## Ease.of.Online.booking -0.182070 0.022882 -7.957 1.76e-15 ***
## Cleanliness 0.155328 0.022550 6.888 5.66e-12 ***
## Baggage.handling 0.104356 0.021108 4.944 7.66e-07 ***
## Gate.location 0.080271 0.017263 4.650 3.32e-06 ***
## Departure.Delay.in.Minutes 0.004065 0.001900 2.139 0.0324 *
## Inflight.service 0.038413 0.022109 1.737 0.0823 .
## Food.and.drink -0.030381 0.020221 -1.502 0.1330
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 28419 on 20713 degrees of freedom
## Residual deviance: 18621 on 20698 degrees of freedom
## AIC: 18653
##
## Number of Fisher Scoring iterations: 5
# anova
forward$anova
## Step Df Deviance Resid. Df Resid. Dev
## 1 NA NA 20713 28419.03
## 2 + Online.boarding -1 5817.785334 20712 22601.24
## 3 + Inflight.entertainment -1 1977.286991 20711 20623.96
## 4 + Leg.room.service -1 919.615584 20710 19704.34
## 5 + On.board.service -1 330.017828 20709 19374.32
## 6 + Departure.Arrival.time.convenient -1 313.960049 20708 19060.36
## 7 + Inflight.wifi.service -1 132.350977 20707 18928.01
## 8 + Seat.comfort -1 88.837003 20706 18839.18
## 9 + Arrival.Delay.in.Minutes -1 56.131794 20705 18783.04
## 10 + Ease.of.Online.booking -1 54.482287 20704 18728.56
## 11 + Cleanliness -1 38.182605 20703 18690.38
## 12 + Baggage.handling -1 37.482899 20702 18652.90
## 13 + Gate.location -1 22.024474 20701 18630.87
## 14 + Departure.Delay.in.Minutes -1 4.706415 20700 18626.17
## 15 + Inflight.service -1 3.338926 20699 18622.83
## 16 + Food.and.drink -1 2.262173 20698 18620.56
## AIC
## 1 28421.03
## 2 22605.24
## 3 20629.96
## 4 19712.34
## 5 19384.32
## 6 19072.36
## 7 18942.01
## 8 18855.18
## 9 18801.04
## 10 18748.56
## 11 18712.38
## 12 18676.90
## 13 18656.87
## 14 18654.17
## 15 18652.83
## 16 18652.56
From the result, we can drop the following variables since keeping them do not reduce the AIC too much:
# perform backward stepwise regression
backward <- step(all, direction='backward', scope=formula(all), trace=0)
# view results of backward stepwise regression
summary(backward)
##
## Call:
## glm(formula = satisfaction ~ Inflight.wifi.service + Departure.Arrival.time.convenient +
## Ease.of.Online.booking + Gate.location + Food.and.drink +
## Online.boarding + Seat.comfort + Inflight.entertainment +
## On.board.service + Leg.room.service + Baggage.handling +
## Inflight.service + Cleanliness + Departure.Delay.in.Minutes +
## Arrival.Delay.in.Minutes, family = binomial(link = "logit"),
## data = train_re)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6668 -0.7052 -0.2557 0.6598 3.5119
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.124802 0.123413 -57.731 < 2e-16 ***
## Inflight.wifi.service 0.324612 0.022605 14.360 < 2e-16 ***
## Departure.Arrival.time.convenient -0.277757 0.014535 -19.109 < 2e-16 ***
## Ease.of.Online.booking -0.182070 0.022882 -7.957 1.76e-15 ***
## Gate.location 0.080271 0.017263 4.650 3.32e-06 ***
## Food.and.drink -0.030381 0.020221 -1.502 0.1330
## Online.boarding 0.839480 0.019052 44.062 < 2e-16 ***
## Seat.comfort 0.137125 0.020757 6.606 3.95e-11 ***
## Inflight.entertainment 0.159662 0.025462 6.271 3.59e-10 ***
## On.board.service 0.322724 0.018974 17.009 < 2e-16 ***
## Leg.room.service 0.356670 0.015826 22.537 < 2e-16 ***
## Baggage.handling 0.104356 0.021108 4.944 7.66e-07 ***
## Inflight.service 0.038413 0.022109 1.737 0.0823 .
## Cleanliness 0.155328 0.022550 6.888 5.66e-12 ***
## Departure.Delay.in.Minutes 0.004065 0.001900 2.139 0.0324 *
## Arrival.Delay.in.Minutes -0.007914 0.001887 -4.194 2.74e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 28419 on 20713 degrees of freedom
## Residual deviance: 18621 on 20698 degrees of freedom
## AIC: 18653
##
## Number of Fisher Scoring iterations: 5
# anova
backward$anova
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 NA NA 20698 18620.56 18652.56
From the result, we do not drop any variable in backward elimination process.
# perform stepwise regression
both <- step(intercept_only, direction='both', scope=formula(all), trace=0)
# view results of stepwise regression
summary(both)
##
## Call:
## glm(formula = satisfaction ~ Online.boarding + Inflight.entertainment +
## Leg.room.service + On.board.service + Departure.Arrival.time.convenient +
## Inflight.wifi.service + Seat.comfort + Arrival.Delay.in.Minutes +
## Ease.of.Online.booking + Cleanliness + Baggage.handling +
## Gate.location + Departure.Delay.in.Minutes + Inflight.service +
## Food.and.drink, family = binomial(link = "logit"), data = train_re)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6668 -0.7052 -0.2557 0.6598 3.5119
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.124802 0.123413 -57.731 < 2e-16 ***
## Online.boarding 0.839480 0.019052 44.062 < 2e-16 ***
## Inflight.entertainment 0.159662 0.025462 6.271 3.59e-10 ***
## Leg.room.service 0.356670 0.015826 22.537 < 2e-16 ***
## On.board.service 0.322724 0.018974 17.009 < 2e-16 ***
## Departure.Arrival.time.convenient -0.277757 0.014535 -19.109 < 2e-16 ***
## Inflight.wifi.service 0.324612 0.022605 14.360 < 2e-16 ***
## Seat.comfort 0.137125 0.020757 6.606 3.95e-11 ***
## Arrival.Delay.in.Minutes -0.007914 0.001887 -4.194 2.74e-05 ***
## Ease.of.Online.booking -0.182070 0.022882 -7.957 1.76e-15 ***
## Cleanliness 0.155328 0.022550 6.888 5.66e-12 ***
## Baggage.handling 0.104356 0.021108 4.944 7.66e-07 ***
## Gate.location 0.080271 0.017263 4.650 3.32e-06 ***
## Departure.Delay.in.Minutes 0.004065 0.001900 2.139 0.0324 *
## Inflight.service 0.038413 0.022109 1.737 0.0823 .
## Food.and.drink -0.030381 0.020221 -1.502 0.1330
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 28419 on 20713 degrees of freedom
## Residual deviance: 18621 on 20698 degrees of freedom
## AIC: 18653
##
## Number of Fisher Scoring iterations: 5
# anova
both$anova
## Step Df Deviance Resid. Df Resid. Dev
## 1 NA NA 20713 28419.03
## 2 + Online.boarding -1 5817.785334 20712 22601.24
## 3 + Inflight.entertainment -1 1977.286991 20711 20623.96
## 4 + Leg.room.service -1 919.615584 20710 19704.34
## 5 + On.board.service -1 330.017828 20709 19374.32
## 6 + Departure.Arrival.time.convenient -1 313.960049 20708 19060.36
## 7 + Inflight.wifi.service -1 132.350977 20707 18928.01
## 8 + Seat.comfort -1 88.837003 20706 18839.18
## 9 + Arrival.Delay.in.Minutes -1 56.131794 20705 18783.04
## 10 + Ease.of.Online.booking -1 54.482287 20704 18728.56
## 11 + Cleanliness -1 38.182605 20703 18690.38
## 12 + Baggage.handling -1 37.482899 20702 18652.90
## 13 + Gate.location -1 22.024474 20701 18630.87
## 14 + Departure.Delay.in.Minutes -1 4.706415 20700 18626.17
## 15 + Inflight.service -1 3.338926 20699 18622.83
## 16 + Food.and.drink -1 2.262173 20698 18620.56
## AIC
## 1 28421.03
## 2 22605.24
## 3 20629.96
## 4 19712.34
## 5 19384.32
## 6 19072.36
## 7 18942.01
## 8 18855.18
## 9 18801.04
## 10 18748.56
## 11 18712.38
## 12 18676.90
## 13 18656.87
## 14 18654.17
## 15 18652.83
## 16 18652.56
From the result, we can drop the following variables since keeping them do not reduce the AIC too much:
# Create the final model dropping the above 4 variables:
model_final <- glm(formula = satisfaction ~ Inflight.wifi.service + Departure.Arrival.time.convenient +
Ease.of.Online.booking + Online.boarding + Seat.comfort + Inflight.entertainment +
On.board.service + Leg.room.service + Baggage.handling + Cleanliness +
Arrival.Delay.in.Minutes, family = binomial(logit), data = train_re)
summary(model_final)
##
## Call:
## glm(formula = satisfaction ~ Inflight.wifi.service + Departure.Arrival.time.convenient +
## Ease.of.Online.booking + Online.boarding + Seat.comfort +
## Inflight.entertainment + On.board.service + Leg.room.service +
## Baggage.handling + Cleanliness + Arrival.Delay.in.Minutes,
## family = binomial(logit), data = train_re)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6264 -0.7030 -0.2548 0.6603 3.4835
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.9804384 0.1124365 -62.083 < 2e-16 ***
## Inflight.wifi.service 0.3316896 0.0226146 14.667 < 2e-16 ***
## Departure.Arrival.time.convenient -0.2575741 0.0138875 -18.547 < 2e-16 ***
## Ease.of.Online.booking -0.1567413 0.0222052 -7.059 1.68e-12 ***
## Online.boarding 0.8218477 0.0185445 44.318 < 2e-16 ***
## Seat.comfort 0.1363618 0.0204004 6.684 2.32e-11 ***
## Inflight.entertainment 0.1582016 0.0230386 6.867 6.56e-12 ***
## On.board.service 0.3315938 0.0182477 18.172 < 2e-16 ***
## Leg.room.service 0.3589314 0.0157224 22.829 < 2e-16 ***
## Baggage.handling 0.1188003 0.0194095 6.121 9.32e-10 ***
## Cleanliness 0.1425549 0.0216627 6.581 4.68e-11 ***
## Arrival.Delay.in.Minutes -0.0040447 0.0005157 -7.844 4.38e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 28419 on 20713 degrees of freedom
## Residual deviance: 18653 on 20702 degrees of freedom
## AIC: 18677
##
## Number of Fisher Scoring iterations: 5
We will use the model_final to predict and evaluate its performance on the test set.
# prediction using test_re data
test_re$fitted <- predict(model_final, test_re)
# calculate the predicted probabilites
test_re$prob <- exp(test_re$fitted)/(1+exp(test_re$fitted))
# confusion matrix and statistics
prob_cut_off = sum(test_re$satisfaction)/nrow(test_re)
test_re$predict <- as.numeric(test_re$prob > prob_cut_off)
xtab_num <- table(test_re$predict, test_re$satisfaction)
caret::confusionMatrix(xtab_num, positive = "1")
## Confusion Matrix and Statistics
##
##
## 0 1
## 0 2332 405
## 1 601 1841
##
## Accuracy : 0.8058
## 95% CI : (0.7947, 0.8165)
## No Information Rate : 0.5663
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6086
##
## Mcnemar's Test P-Value : 7.845e-10
##
## Sensitivity : 0.8197
## Specificity : 0.7951
## Pos Pred Value : 0.7539
## Neg Pred Value : 0.8520
## Prevalence : 0.4337
## Detection Rate : 0.3555
## Detection Prevalence : 0.4715
## Balanced Accuracy : 0.8074
##
## 'Positive' Class : 1
##
# AUC plot (# Both curves from LDA and Step-wise regression in the same graph)
rocplot_lda <- roc(predictions, labels) # from LDA
rocplot_lda <- roc(predictions, labels) # from LDA
rocplot_re <- roc(test_re$prob, as.factor(test_re$satisfaction)) # from Step-wise model
plot(rocplot_lda, col = 'red', lty = 1, main = 'ROC')
plot(rocplot_re, col = '#3F7ED5', lty = 2, add = TRUE)
legend(0.8, 0.15, legend=c("LDA", "Step-wise"),
col=c("red", "#3F7ED5"), lty=1:2)
text(0.3, 0.70, paste0("LDA AUC = ", round(auc(rocplot_lda), 5)), col = "red", cex = 1.0)
text(0.3, 0.60, paste0("Step-wise Regression AUC = ", round(auc(rocplot_re), 5)), col = "#3F7ED5", cex = 1.0)
As we can see from the plot, the AUC value from the LDA and stepwise Rrgression model are similar (about 0.87) which indicates that both methods perform similarly well in terms of their ability to distinguish between the customers feeling satisfied and neutral or dissatisfied.
To further explore the relationship between the rated features and the different ratings, we carry out correspondence analysis on the 10 variables that were rated.
We will first create a contigency table with the 10 rated features as columns and the 6 ratings (0 - 5) as rows.
# create a contingency table
attach(df1_CA)
names = colnames(df1_CA)
c_matrix = matrix(0, nrow = 6, ncol = 10)
for (i in 1:10){
for (j in 1:6){
c_matrix[j,i] = table(df1_CA[names[i]])[j]
}
}
detach(df1_CA)
# convert table from matrix to dataframe
ctable <- data.frame(c_matrix, check.names = FALSE)
# rename rows (rating from 0 - 5)
row.names(ctable) <- 0:5
# rename columns (10 rated features)
colnames(ctable) <- colnames(df1_CA)
# contigency table
ctable
## Inflight.wifi.service Departure.Arrival.time.convenient
## 0 812 1374
## 1 4469 3899
## 2 6481 4336
## 3 6298 4399
## 4 4965 6312
## 5 2868 5573
## Ease.of.Online.booking Food.and.drink Online.boarding Inflight.entertainment
## 0 1193 25 651 4
## 1 4342 3210 2558 3193
## 2 6021 5375 4417 4318
## 3 5927 5474 5296 4725
## 4 4854 6183 7682 7347
## 5 3556 5626 5289 6306
## On.board.service Leg.room.service Inflight.service Cleanliness
## 0 2 126 2 2
## 1 2906 2536 1775 3404
## 2 3658 5000 2838 3968
## 3 5690 4940 5005 6046
## 4 7814 7075 9352 6771
## 5 5823 6216 6921 5702
A quick chi-square test conducted on the contingency table presented a p-value of approximately 0, indicating that the row variables (ratings) has statistically significant associations with the column variables (rated features).
# chi-square test
chisq <- chisq.test(ctable)
chisq
##
## Pearson's Chi-squared test
##
## data: ctable
## X-squared = 16298, df = 45, p-value < 2.2e-16
# CA
res.ca.air <- CA(ctable, graph = FALSE)
summary(res.ca.air)
##
## Call:
## CA(X = ctable, graph = FALSE)
##
## The chi square of independence between the two variables is equal to 16297.54 (p-value = 0 ).
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## Variance 0.047 0.011 0.002 0.002 0.000
## % of var. 74.523 18.147 3.904 2.944 0.482
## Cumulative % of var. 74.523 92.670 96.574 99.518 100.000
##
## Rows
## Iner*1000 Dim.1 ctr cos2 Dim.2
## 0 | 24.460 | 1.056 38.479 0.738 | 0.627
## 1 | 7.662 | 0.224 13.380 0.819 | -0.050
## 2 | 9.049 | 0.186 13.284 0.689 | -0.099
## 3 | 2.457 | 0.040 0.696 0.133 | -0.074
## 4 | 9.297 | -0.171 16.526 0.834 | 0.062
## 5 | 10.016 | -0.199 17.635 0.826 | 0.062
## ctr cos2 Dim.3 ctr cos2
## 0 55.770 0.260 | -0.031 0.639 0.001 |
## 1 2.741 0.041 | 0.022 2.459 0.008 |
## 2 15.488 0.195 | 0.043 13.188 0.036 |
## 3 10.039 0.467 | -0.064 35.085 0.351 |
## 4 8.949 0.110 | -0.037 14.481 0.038 |
## 5 7.013 0.080 | 0.063 34.148 0.084 |
##
## Columns
## Iner*1000 Dim.1 ctr cos2 Dim.2
## Inflight.wifi.service | 13.211 | 0.338 24.354 0.865 | -0.123
## Departure.Arrival.time.convenient | 9.887 | 0.199 8.438 0.400 | 0.234
## Ease.of.Online.booking | 13.423 | 0.366 28.506 0.996 | 0.005
## Food.and.drink | 2.167 | -0.038 0.315 0.068 | -0.127
## Online.boarding | 1.493 | -0.014 0.039 0.012 | 0.092
## Inflight.entertainment | 2.736 | -0.143 4.375 0.750 | -0.032
## On.board.service | 3.278 | -0.167 5.970 0.854 | -0.030
## Leg.room.service | 2.136 | -0.111 2.620 0.575 | -0.031
## Inflight.service | 12.201 | -0.331 23.317 0.896 | 0.098
## Cleanliness | 2.408 | -0.098 2.065 0.402 | -0.086
## ctr cos2 Dim.3 ctr cos2
## Inflight.wifi.service 13.166 0.114 | -0.045 8.085 0.015 |
## Departure.Arrival.time.convenient 47.945 0.554 | 0.052 10.818 0.027 |
## Ease.of.Online.booking 0.026 0.000 | -0.015 0.965 0.002 |
## Food.and.drink 14.165 0.747 | 0.059 14.295 0.162 |
## Online.boarding 7.385 0.565 | -0.049 9.957 0.164 |
## Inflight.entertainment 0.906 0.038 | 0.062 15.850 0.142 |
## On.board.service 0.787 0.027 | -0.050 10.180 0.076 |
## Leg.room.service 0.842 0.045 | 0.063 16.312 0.188 |
## Inflight.service 8.328 0.078 | -0.052 10.948 0.022 |
## Cleanliness 6.451 0.306 | -0.025 2.589 0.026 |
Dimension 1 (74.48%) and dimension 2 (18.23%) add up to nearly 93%. This means the two dimensions capture most of the information in the dataset.
Overall, most ranked features are grouped together with the exception of Departure.Arrival.time.convenient. This could mean that Departure.arrival.time.convenient is less strongly associated with the other features.
Whereas Inflight.wifi.service and Ease.of.Online.booking have their own little cluster. Departure.Arrival.time.convenient has little similarity with the rest of the variables. The variables that are the least dissimilar are Ease.of.Online.booking and Online.boarding.
Rating 0 is very far away from all rated features, suggesting that rating 0 is highly discriminated from the other ratings and rated features. This is reasonable, since 0 means the services were not available on the flights for passengers to rate.
There is a division between rating 1 - 3 in the upper right quadrant of the plot (positive association between ratings and rated features) and rating 4 - 5 in the lower left quadrant (negative association between ratings and rated features). This means that the relationship between ratings and rated features is not consistently positive or negative, and may indicate subgroups within the ratings or features.
Now let’s examine the relationship between the column variables which are the services, and the row variables, which are the ratings.
We can see that the rating of 2 and Inflight.wifi.service are both distant from the origin, meaning Inflight.wifi.service has an association with a low rating of 2. So Inflight.wifi.service has mostly been not great.
Similarly, Ease.of.Online.booking and rating 1 are far from the origin, and the angle in between are small. So it’s not easy to book online. Inflight.service has mostly been great, whereas most of the service variables in the clusters are somewhere between neutral to positive.
# biplot
fviz_ca_biplot(res.ca.air, repel = TRUE)
In dimension 1, rating 0 is the largest contributor to thee total variance amongst the row variables. However, it should be noted that 0 in our dataset does not indicate the lowest level of satisfaction, but that particular services were not offered on the flights and therefore, not applicable to rating. Amogst the column variables, Ease.of.Online.booking, Inflight.wifi.service and Inflight.service are the largest contributors.
# contribution of axis 1 - row variables
fviz_contrib(res.ca.air, choice="row", axes=1)
# contribution of axis 1 - column variables
fviz_contrib(res.ca.air, choice="col", axes=1)
In dimension 2, once again, rating 0 captures the largest amount of variance, along with Departure.Arrival.time.convenient.
# contribution of axis 2 - row variables
fviz_contrib(res.ca.air, choice="row", axes=2)
# contribution of axis 2 - column variables
fviz_contrib(res.ca.air, choice="col", axes=2)
To further explore the prediction power of the statistical tools we have utilized so far, we will conduct a comparison between the following models:
# convert PCs into a dataframe
df_pca <- as.data.frame(airline.pca$x)
# add the response variable to the dataframe
df_pca["satisfaction"] <- df$satisfaction
# numerically encode the response variable
df_pca$satisfaction <- ifelse(df_pca$satisfaction=="satisfied", 1, 0)
# a dataframe of only PCs and the response variable
head(df_pca)
## PC1 PC2 PC3 PC4 PC5 PC6
## 1 -2.7088943 1.0833865 1.31527211 -1.4199776 0.13475759 0.2693778
## 2 -1.6103765 -2.4837470 -0.44022622 0.3543749 -0.86799607 0.1264887
## 3 2.8409125 -0.4185596 -0.05629095 0.7029403 -0.01498055 -0.5642194
## 4 3.4419198 -2.6545624 -3.19878268 0.6617597 -0.93195332 -0.2853510
## 5 2.1482181 0.2402589 -1.72739473 0.3664105 0.94506445 0.4722932
## 6 -0.9118258 -0.2400422 -2.51014836 0.6044318 -0.51284227 0.2438900
## PC7 PC8 PC9 PC10 PC11 PC12
## 1 -0.2329317 0.0166523 -0.21475732 -0.005031209 0.74897571 -1.03958767
## 2 0.1124948 0.1234445 -0.01161041 -0.062345268 -0.53491978 -0.03727264
## 3 -1.3645888 -1.8495066 -0.91635691 -0.877084449 0.09403787 -0.13529958
## 4 0.4647119 -1.1738962 0.59454692 0.028274160 -0.27491805 -0.56730652
## 5 -0.7558237 0.7884415 -0.88083144 -0.179415316 0.04975187 -0.60585078
## 6 0.3087657 -0.2262218 -1.90000231 0.326358145 -0.52122419 -1.33555626
## PC13 PC14 PC15 satisfaction
## 1 0.90557801 0.12487081 -0.124477572 1
## 2 -1.33761503 0.50009863 0.003429964 1
## 3 0.02900768 0.09296906 -0.019406409 0
## 4 0.13917492 0.61890248 0.096009497 1
## 5 -1.13850144 0.53133554 0.360029351 1
## 6 0.30752308 -0.70375654 -0.022490007 1
# train/test split (80/20)
sample_pca <- sample.int(n = nrow(df_pca), size = floor(.8*nrow(df_pca)), replace = F) # 80% of dataset
train_pca <- df_pca[sample, ]
test_pca <- df_pca[-sample, ]
# fit model on train set
model_train_pca <- glm(satisfaction ~ ., family = "binomial"(link="logit"), data = train_pca)
# use test set to predict
test_pca$fitted_pca <- predict(model_train_pca, test_pca)
# calculate the predicted probabilities
test_pca$prob_pca <- exp(test_pca$fitted_pca)/(1+exp(test_pca$fitted_pca))
# the classification report: sensitivity, specificity and accuracy
prob_cut_off_pca = sum(test_pca$satisfaction)/nrow(test_pca)
test_pca$predict <- as.numeric(test_pca$prob_pca > prob_cut_off_pca)
xtab_pca <- table(test_pca$predict, test_pca$satisfaction)
caret::confusionMatrix(xtab_pca, positive = "1")
## Confusion Matrix and Statistics
##
##
## 0 1
## 0 2325 408
## 1 608 1838
##
## Accuracy : 0.8038
## 95% CI : (0.7927, 0.8146)
## No Information Rate : 0.5663
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6047
##
## Mcnemar's Test P-Value : 4.287e-10
##
## Sensitivity : 0.8183
## Specificity : 0.7927
## Pos Pred Value : 0.7514
## Neg Pred Value : 0.8507
## Prevalence : 0.4337
## Detection Rate : 0.3549
## Detection Prevalence : 0.4723
## Balanced Accuracy : 0.8055
##
## 'Positive' Class : 1
##
# convert categorical variables into numerical
all_df$satisfaction <- ifelse(all_df$satisfaction=="satisfied", 1, 0)
all_df$Class <- as.numeric(as.factor(all_df$Class))
all_df$Customer.Type <- ifelse(all_df$Customer.Type =="Loyal Customer", 1, 0)
all_df$Gender <- ifelse(all_df$Gender=="Male", 1, 0)
all_df$Type.of.Travel <- ifelse(all_df$Type.of.Travel =="Personal Travel", 1, 0)
head(all_df,2)
## Gender Customer.Type Age Type.of.Travel Class Flight.Distance
## 1 0 1 52 0 2 160
## 2 0 1 36 0 1 2863
## Inflight.wifi.service Departure.Arrival.time.convenient
## 1 5 4
## 2 1 1
## Ease.of.Online.booking Gate.location Food.and.drink Online.boarding
## 1 3 4 3 4
## 2 3 1 5 4
## Seat.comfort Inflight.entertainment On.board.service Leg.room.service
## 1 3 5 5 5
## 2 5 4 4 4
## Baggage.handling Checkin.service Inflight.service Cleanliness
## 1 5 2 5 5
## 2 4 3 4 5
## Departure.Delay.in.Minutes Arrival.Delay.in.Minutes satisfaction
## 1 50 44 1
## 2 0 0 1
# train/test split (80/20)
sample_cat <- sample.int(n = nrow(all_df), size = floor(.8*nrow(all_df)), replace = F) # 80% of dataset
train_cat <- all_df[sample_cat, ]
test_cat <- all_df[-sample_cat, ]
# fit model on train set
model_cat <- glm(satisfaction ~ ., family = "binomial"(link="logit"), data = train_cat)
# use test set to predict
test_cat$fitted <- predict(model_cat, test_cat)
# calculate the predicted probabilities
test_cat$prob <- exp(test_cat$fitted)/(1+exp(test_cat$fitted))
# the classification report: sensitivity, specificity and accuracy
prob_cut_off_cat = sum(test_cat$satisfaction)/nrow(test_cat)
test_cat$predict <- as.numeric(test_cat$prob > prob_cut_off_cat)
xtab_cat <- table(test_cat$predict, test_cat$satisfaction)
caret::confusionMatrix(xtab_cat, positive = "1")
## Confusion Matrix and Statistics
##
##
## 0 1
## 0 2533 321
## 1 360 1965
##
## Accuracy : 0.8685
## 95% CI : (0.859, 0.8776)
## No Information Rate : 0.5586
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7338
##
## Mcnemar's Test P-Value : 0.1453
##
## Sensitivity : 0.8596
## Specificity : 0.8756
## Pos Pred Value : 0.8452
## Neg Pred Value : 0.8875
## Prevalence : 0.4414
## Detection Rate : 0.3794
## Detection Prevalence : 0.4489
## Balanced Accuracy : 0.8676
##
## 'Positive' Class : 1
##
When comparing the prediction results from the LDA model, the logistic regression model with only principal components and the logistic regression model with all numerical and categorical variables, we can see that all three models perform similarly well. Specifically, they achieved an accuracy of 80.4 - 86.8%, sensitivity ranging from 78.2 - 84.6% and specificity ranging from 79.3 - 88.5%. The logistic regression model with all numerical and categorical variables has the highest accuracy, sensitivity and specificity, although the improvement over other models is not significant enough to warrant further consideration.
We can use Cronbach’s alpha coefficient to assess internal consistency, which measures the degree to which items in a factor are correlated with each other. Cronbach’s alpha ranges from 0 to 1, where higher values indicate greater internal consistency. A coefficient higher than 0.6 is generally considered acceptable for most research purposes.
Except for the group “Others”, the value of Cronbach’s alpha coefficients are higher than 0.6, indicating a moderate to high level of internal consistency between the items in each group.
Since the group “Others” consists of only one item, Online.Boarding, Cronbach’s alpha coefficient will not be applicable.
# group the data
In_flight_experience <- subset(df, select = c(Inflight.entertainment, Cleanliness, Seat.comfort, Food.and.drink))
Convenience <- subset(df, select = c(Ease.of.Online.booking, Inflight.wifi.service, Gate.location, Departure.Arrival.time.convenient))
Service <- subset(df, select = c(Baggage.handling, Inflight.service, On.board.service, Leg.room.service))
Punctuation <- subset(df, select = c(Departure.Delay.in.Minutes, Arrival.Delay.in.Minutes))
# Cronbach's alpha coefficient
cronbach.alpha(In_flight_experience)
##
## Cronbach's alpha for the 'In_flight_experience' data-set
##
## Items: 4
## Sample units: 25893
## alpha: 0.879
cronbach.alpha(Convenience)
##
## Cronbach's alpha for the 'Convenience' data-set
##
## Items: 4
## Sample units: 25893
## alpha: 0.772
cronbach.alpha(Service)
##
## Cronbach's alpha for the 'Service' data-set
##
## Items: 4
## Sample units: 25893
## alpha: 0.777
cronbach.alpha(Punctuation)
##
## Cronbach's alpha for the 'Punctuation' data-set
##
## Items: 2
## Sample units: 25893
## alpha: 0.982
This project employed Principal Component Analysis (PCA) and Factor Analysis (FA) to examine the fundamental structure of the US Airline Passengers dataset and detect features that are strongly linked to the passengers’ satisfaction level. The results showed that while there are some differences in the grouping between PCA and FA, interval features can generally be categorized into 5 groups:
To assess the reliability of FA, Cronbach’s alpha coefficient for each group created from the FA procedure was calculated. The coefficient values of higher than 0.6 confirmed a relatively high internal consistency between items in each group.
Linear Discriminant Analysis (LDA) was also utilized on the basis of a train/test dataset split in search of linear combinations of variables that could maximize the separation between groups and afterward, prediction of passengers’ satisfaction level. The outcomes indicated an accuracy of 81.35%, and the AUC value of 0.87 confirmed the relatively good performance of the prediction model.
A model built on the results of step-wise regression was also studied and compared with the LDA model. In terms of accuracy, the step-wise regression model performed as well as the LDA model, achieving an accuracy of 80.38% and the same AUC value of 0.87.
To delve deeper into the relationship between the rated features and the different ratings, correspondence analysis was conducted. Approximately 92.71% of the total variance are captured in the first 2 dimensions, with rating 0 (service not applicable for rating) being the largest contributor in both dimensions.
The performance of the LDA model, the step-wise regression model, a logistic regression model with only the principal components and a logistic regression model with all original variables are comparable.
SEE, J. C. G., & PORIO, E. E. (2015). Assessing Social Vulnerability to Flooding in Metro Manila Using Principal Component Analysis. Philippine Sociological Review, 63, 53–80. http://www.jstor.org/stable/24717187